Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add resplitting functionality to Flower Datasets #2427

Merged
merged 25 commits into from
Oct 16, 2023

Conversation

adam-narozniak
Copy link
Contributor

@adam-narozniak adam-narozniak commented Sep 27, 2023

Issue

The datasets downloaded from Hugging Face come with certain splits. Yet, users might want to use different divisions of the whole dataset, which is currently not possible.

Description

fds = FederatedDataset(dataset="mnist", partitioners={"train": 10})
# then I can
fds.load_partition(10, "train")

But what if the dataset had three splits "train", "valid", "test" (not 2 like "mnist").
In that case, you might want to have a single dataset from which 10 partitions are created.

And the reverse might hold true for just a single-split dataset. Sometimes datasets have just train split and need to create a centralized dataset. This is currently impossible.

Proposal

Enable the missing functionality described above (and add tests to make sure they are met).

  • Introduce the resplitter keyword in the FederatedDataset
  • Enable two major ways of creation of the new dataset (with different splits)
  • The first one Callable[[DatasetDict], DatasetDict]] You might perform as sophisticated a change as you wish - just provide this as resplitter to the FederatedDataset (Note: all the checks to use that correctly are on the user side)
  • The second option is Dict[Tuple[str, ...], str] (called for convenience ResplitStrategy). Here is an example {("train", "valid"): "bigger_train"}. We create a "bigger_train" split from the "train" and "valid" splits. From this object a Resplitter (newly introduced class) is created. That is essentially Callable[[DatasetDict], DatasetDict]] with additional check if the splits are used correctly (you can use only the existing splits, and you cannot create a new dataset that has two splits with the same name)
    This works as follows:
    First option
  def resplit(dataset: DatasetDict) -> DatasetDict:
      return DatasetDict(
          {
              "bigger_train": concatenate_datasets(
                  [dataset["train"], dataset["valid]]
              ),
              "test": dataset["test"]
          }
      )

  fds = FederatedDataset(
      dataset=self.dataset_name, resplitter=resplit, partitioners={"bigger_train": 100}
  )
  bigger_train = fds.load_full("bigger_train")
  # Or just load a partition
  partition_from_bigger = fds.load_partition("bigger_train": 100)

The second option (that does the same thing) ResplitterStrategy specification => internally Resplitter creation

fds = FederatedDataset(
            dataset=self.dataset_name, resplitter={("train", "valid"): "bigger_train"}, partitioners={"bigger_train": 100}
     )
# if I made a mistake using here 
# e.g. resplitter = {("train", "split-that-does- not-exist"): "bigger_train"}
# I'll get a meaningful error
# similarly if I did sth like that 
# resplitter = {("train",): "new", ("valid", ): "new"}
# there can't be two splits named "new" so I'll get an error
bigger_train = fds.load_full("bigger_train")
# Or just load a partition
partition_from_bigger = fds.load_partition("bigger_train": 100)

datasets/flwr_datasets/federated_dataset.py Outdated Show resolved Hide resolved
datasets/flwr_datasets/federated_dataset.py Outdated Show resolved Hide resolved
datasets/flwr_datasets/merge_splitter.py Outdated Show resolved Hide resolved
datasets/flwr_datasets/merge_splitter.py Outdated Show resolved Hide resolved
@adam-narozniak
Copy link
Contributor Author

Maybe the MergeResplitter should be simply Merger or SplitMerger? WDYT?

@adam-narozniak
Copy link
Contributor Author

Also, one more thing. I don't think the Dict[Tuple[str, ...], str] (the Tuple as the key) is that common. It matches more naturally the flow of operation: FROM (some keywords that already exist) -----create---> TO (new keywords). WDYT? We might change that.

Copy link
Contributor

@jafermarq jafermarq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool functionality. I just left a single comment.

datasets/flwr_datasets/merge_resplitter.py Show resolved Hide resolved
@jafermarq
Copy link
Contributor

Also, one more thing. I don't think the Dict[Tuple[str, ...], str] (the Tuple as the key) is that common. It matches more naturally the flow of operation: FROM (some keywords that already exist) -----create---> TO (new keywords). WDYT? We might change that.

This felt fine when I wast testing the splitter functionality.

datasets/flwr_datasets/federated_dataset.py Outdated Show resolved Hide resolved
datasets/flwr_datasets/federated_dataset.py Outdated Show resolved Hide resolved
datasets/flwr_datasets/merge_resplitter.py Outdated Show resolved Hide resolved
datasets/flwr_datasets/merge_resplitter.py Show resolved Hide resolved
datasets/flwr_datasets/merge_resplitter_test.py Outdated Show resolved Hide resolved
@adam-narozniak
Copy link
Contributor Author

I've switched the keys and values. I also fixed the resplit_dataset_if_needed (I violated the single responsibility principle there and it had complex initialization and then resplit functionality, inti now moved to utils; that also created circular import of Resplitter so I created a common.typing which of which the name can be changed but typing alone didn't work so if common.typing is not ok, then maybe types.py?)

datasets/flwr_datasets/common/__init__.py Outdated Show resolved Hide resolved
datasets/flwr_datasets/common/typing.py Outdated Show resolved Hide resolved
@danieljanes danieljanes enabled auto-merge (squash) October 16, 2023 12:13
@danieljanes danieljanes merged commit 7f8a7e2 into main Oct 16, 2023
29 checks passed
@danieljanes danieljanes deleted the fds-add-resplitting-functionality branch October 16, 2023 13:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants